Show the code
plot_digit <- function(row) {
digit_mat <- row |>
select(-digit)|>
as.numeric()|>
matrix(nrow = 28)
image(digit_mat[,28:1])
}Supervised Machine Learning
Henry Black, Nhi Luong, James Le
October 14, 2025
calc_prop <- function (region, row) {
# Take row from mnist and transform into a "digit" matrix
digit_mat <- row |>
as.numeric()|>
matrix(nrow = 28) |>
t()
# Find positions of pixels from "region"
pos = (region==1)
# Subset "digit" to the positions and count dark pixels (grey>20)
dark = digit_mat[pos]>20
# Return proportion of dark pixels of "image" in "region"
return(sum(dark)/sum(pos))
}Here we define the two regions that we will be using to predict the correct digit.
mnist to include only our two digits of interest and 1000 observations.In this section, we have created a dataset called ‘features_tbl’ that includes four columns (id, digit, prop1, and prop2). Column ‘id’ contains consecutive number from 1 to 1000. ‘digit’ is a factor corresponding to digit 3 and digit 4. We calculated the proportion of dark pixels for ‘region1’ and saved the values in column ‘prop1’. We did the same calculation for ‘region2’ and saved the values under column ‘prop2’.
Here we generate a plot of prop1 versus prop2 and use color to represent the two digit types.
Here, we use our features dataset (see section Feature Dataset) to create our training and testing datasets.
We then create our cross validation data set using the default value of 10 folds. Additionally we create a grid which contains both the values of k that we wish to test, along with the 4 different weight functions that we want to test in combination with our k values.
Next, we create and tune our workflow for both our optimal k and our optimal workflow.
We decided to tune both of these parameters at the same time to make sure that there was not a scenario where the optimal k for one weight function was not the optimal k for another one.
Rectangular: This function is the most basic of all of them. It assigns all distances to the nearest neighbors a weight of 1, so all values are weighted equally, regardless of how far they are from the unknown point.
Triangular: The neighbors are weighted based on their triangular distance. As you move away from the point of interest, the weight assigned to neighboring data points decreases.
Inverse: This weight function is somewhat similar in spirit to the triangular weight function, where neighbors closer to the unknown point are given a higher weight, but the way that it does this is somewhat different. With this function, values are given a weight which is \(\frac{1}{d}\) where \(d\) is the distance that that point is from the unkown. In essence, values closer to the unknown get a higher weight than the ones that are farther away.
Cos: Similarly to triangular but the distance would be transform to \(cos(d)\) for weight. This makes closer points weighted more than point further away within a cosine curve
features_recipe <-
recipe(formula = digit ~ prop1 + prop2, data = features_train_tbl) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors())
features_spec <-
nearest_neighbor(neighbors = tune(), weight_func = tune()) |>
set_mode("classification") |>
set_engine("kknn")
features_workflow <-
workflow() |>
add_recipe(features_recipe) |>
add_model(features_spec)
set.seed(12345)
features_tune <-
tune_grid(features_workflow,
resamples = features_cv,
grid = features_neighbors)
autoplot(features_tune, metric="accuracy")# A tibble: 8 × 8
neighbors weight_func .metric .estimator mean n std_err .config
<int> <chr> <chr> <chr> <dbl> <int> <dbl> <chr>
1 6 rectangular accuracy binary 0.952 10 0.00667 pre0_mod07_post0
2 17 triangular accuracy binary 0.952 10 0.00717 pre0_mod16_post0
3 17 inv accuracy binary 0.951 10 0.00778 pre0_mod14_post0
4 11 triangular accuracy binary 0.951 10 0.00631 pre0_mod12_post0
5 17 cos accuracy binary 0.95 10 0.00791 pre0_mod13_post0
6 28 inv accuracy binary 0.95 10 0.00722 pre0_mod22_post0
7 33 inv accuracy binary 0.95 10 0.00722 pre0_mod26_post0
8 39 inv accuracy binary 0.95 10 0.00722 pre0_mod30_post0
Our final conclusion shows that we have k optimized to 6 and the optimal weight function is the rectangular weight function.
Next, we need to finalize our workflow and fit our model.
# A tibble: 5 × 8
neighbors weight_func .metric .estimator mean n std_err .config
<int> <chr> <chr> <chr> <dbl> <int> <dbl> <chr>
1 6 rectangular accuracy binary 0.952 10 0.00667 pre0_mod07_post0
2 17 triangular accuracy binary 0.952 10 0.00717 pre0_mod16_post0
3 17 inv accuracy binary 0.951 10 0.00778 pre0_mod14_post0
4 11 triangular accuracy binary 0.951 10 0.00631 pre0_mod12_post0
5 17 cos accuracy binary 0.95 10 0.00791 pre0_mod13_post0
# A tibble: 1 × 3
neighbors weight_func .config
<int> <chr> <chr>
1 6 rectangular pre0_mod07_post0
Finally, we can calculate our accuracy and view our confusion matrix to see how well our model works.
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.95
2 kap binary 0.899
Truth
Prediction 4 3
4 104 10
3 0 86
Our final accuracy is 95% and we correctly classify 86 out of 96 3’s and 104 out of 104 4’s in the dataset.
prop1 and prop2We are now interested in the few numbers that our model misclassified. Interestingly enough, our model had a 100% True Negative Rate, or in other words, it predicted that the true value was a 4 correctly every time. However, it did mess up a fair amount of the 3s, so we will now plot some of them and attempt to understand why the model misclassified them.
Miss Classification Reason:
These digits got misclassified because of unusual handwritings. In the first plot, the handwritten 3 is tilted, missing the top and some bottom regions. In the second plot, the handwritten digit misses pretty much the top region. The bad handwriting with angled orientation that might overlap the regions we selected in unexpected way
# A tibble: 2 × 3
id prop1 prop2
<int> <dbl> <dbl>
1 386 0.233 0.487
2 427 0.367 0.179
# A tibble: 2 × 4
id .pred_class .pred_3 .pred_4
<int> <fct> <dbl> <dbl>
1 386 4 0.333 0.667
2 427 4 0 1
5s from mnist dataset, saved it under ‘mnist_5’ and then binded with the previous ‘mnist3_4’ dataset by rows. The resulting dataset is called ‘mnist3_4_5’.mnist345_features_recipe <-
recipe(formula = digit ~ prop1 + prop2, data = mnist345_features_train)
mnist345_features_spec <-
nearest_neighbor() |>
set_mode("classification") |>
set_engine("kknn")
mnist345_features_workflow <-
workflow() |>
add_recipe(mnist345_features_recipe) |>
add_model(mnist345_features_spec) # A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.677
2 kap multiclass 0.515
# A tibble: 6 × 4
digit .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 4 accuracy multiclass 0.860
2 3 accuracy multiclass 0.694
3 5 accuracy multiclass 0.485
4 4 kap multiclass 0
5 3 kap multiclass 0
6 5 kap multiclass 0
Truth
Prediction 4 3 5
4 80 4 9
3 6 75 27
5 20 31 48
The confusion matrix of the model shows that digit 3 and 5 get confused more with 35 and 33 miss classifications respectively. We see that 3 gets miss classified as 5 the most and 5 gets miss classified as 3 the most because digit 3 and 5 share similar regions that were chosen as region 1 and 2. The both overlap at the top and bottom regions.